Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication

نویسندگان

  • Tomaz Erjavec
  • Jaka Cibej
  • Spela Arhar Holdt
  • Nikola Ljubesic
  • Darja Fiser
چکیده

This paper presents the first publicly available, manually annotated gold-standard datasets for the annotation of Slovene ComputerMediated Communication. In this type of language, diacritics, punctuation and spaces are often omitted, and phonetic spelling and slang words frequently used, which considerably deteriorates the performance of text processing tools that were trained on standard Slovene. Janes-Norm, which contains 7,816 texts or 184,766 tokens, is a gold-standard dataset for tokenisation, sentence segmentation and word normalisation, whereas Janes-Tag, comprising 2,958 texts or 75,276 tokens, was created for training and evaluating morphosyntactic tagging and lemmatisation tools for non-standard Slovene.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fuzzy Neighbor Voting for Automatic Image Annotation

With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...

متن کامل

The goo300k corpus of historical Slovene

The paper presents a gold-standard reference corpus of historical Slovene containing 1,000 sampled pages from over 80 texts, which were, for the most part, written between 1750 – 1900. Each page of the transcription has an associated facsimile and the words in the texts have been manually annotated with their modern-day equivalent, lemma and part-of-speech. The paper presents the structure of t...

متن کامل

Learning Pragmatics through Computer-Mediated Communication in Taiwan

This study investigated the effectiveness of explicit pragmatic instruction on the acquisition of requests by college-level English as Foreign Language (EFL) learners in Taiwan. The goal was to determine first whether the use of explicit pragmatic instruction had a positive effect on EFL learners’ pragmatic competence. Second, the relative effectiveness of presenting pragmatics through two deli...

متن کامل

IMPACT OF SYNCHRONOUS COMPUTER-MEDIATED COMMUNICATION ON EFL LEARNERS’ COLLABORATION: A QUANTITATIVE ANALYSIS

For the last two decades, computers have entered people’s lives in an unprecedented manner in a way that almost everybody considers life without them rather impossible. In recent years, researchers and educators have been trying to discover how computers and the Internet technology can maximize the quality of language instruction. As such, the present experimental study sought to investigate th...

متن کامل

Gender and Computer-Mediated Communication: Emoticons in a Digital Forum in Persian

This study aimed to gain an insight into whether computer-mediated communication (CMC) in the form of a digital forum can reflect gendered discursive practices. A great deal of research has now established that computer-mediated interactions embody gendered differences in the use of emoticons, but few studies have examined the potential effect of the gender of the emoticon-receiver on the frequ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016